Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors

نویسندگان

  • Rajeev Barua
  • David A. Kranz
  • Anant Agarwal
چکیده

Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access. The compiler implements a solution to the problem of finding communication-minimal partitions of loops and data. Loop and data partitions specify the distribution of loop iterations and array data across processors. A good loop partition maximizes the cache hit rate while a good data partition minimizes remote cache misses. The problems of finding loop and data partitions interact when multiple loops access arrays with differing reference patterns. Our algorithm handles programs with multiple nested parallel loops accessingmany arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes. A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented. The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. The paper presents results obtained from a working compiler on a 16-processor machine for three real applications: Tomcatv, Erlebacher, and Conduct. Our results demonstrate that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency tra c on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication tra c, the problem of deriving the optimal tiling parameters for minimal communication in loops with general a ne index expres...

متن کامل

Executing Nested Parallel Loops on Shared-Memory Multiprocessors

Cache-coherent, bus-based shared-memory multiprocessors are a cost-e ective platform for parallel processing. In scienti c parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...

متن کامل

Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors

This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronizatio...

متن کامل

EXECUTING NESTED PARALLEL LOOPS ON SHARED - MEMORYMULTIPROCESSORSSadun

Cache-coherent, bus-based shared-memory multiprocessors are a cost-eeective platform for parallel processing. In scientiic parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...

متن کامل

Partitioning Regular Applications for Cache-coherent Multiprocessors

In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996